Importing Libraries

Phase 2: Provisioning

Importing Dataset

Using glob we scan the directory of a local folder, getting all files in the process. Using regex we filter out any file type that isn't .csv. We name all files apropriately, read the file and cast them to the appropriate variables.

Converting our dataframes to Time Series

We convert the EventDt column in each avaiable dataframe to DateTime format, after which we set it as an index and then we resmaple the data on a 5 minute frequency. Addtionally we add a daily, weekly and monthly variant of each dataframe that assist us in the data analysis at a later point in this notebook.

Visualisation

Before we went any farther with our research, we wanted to see what each gadget was doing and if it was being read correctly.

As can be observed, the representation of each sensor is a jumble, making it difficult to understand what is going on within. To approach the analysis through this research, we will first compile average temperatures for each device to evaluate where they stand from freezing to warming.

Average Temperature Dictionary

To cut-down on load times, instead of plotting each dataframe values directly, we create a dictionary with the names and average temperature for each of our devices. Said dictionary is later plotted using SeaBorn barplot.

We opted to approach it through visuals from the coldest to the warmest temperature per device to see where each of the standing and categorize it after the average temperature conversion.

After plotting the dictionary we can easily see the average temperature for each dataframe and compare them to one another. This will help us with categorizing and labeling our dataframes.

Categorizing Dataframes

After careful consideration and discussion we decided to seperate our dataframes into three distinct categories: Freezer, Fridge, Pantry

Before moving further with the research, there are a few categories per device that can be followed, including the device's category and customer type; the next step will focus on that.

As can be seen, each gadget was assigned to one of three groups: - Freezer: devices with temperature lower than 0.

Because the data does not contain any labeling, such categorization is an obvious technique taken by group members through the usage of external resources.

Following the categorization of the devices, we want to observe each device's pattern through a line plot, which will allow us to approach any potential anomalies and subsequently approach the selection of view devices to make the research more comfortable.

Analysis of Elimination

Here we plot every dataframe divided based on category from which we will pick the most apropriate dataset to train our model on. In order for dataframe to be deemed apropriate for further analysis it needs to fufill the following conditions:

Elimination of the First Category (Line Plots)

Elimination will be aided by a series of generic graphics, such as line plots to assess consistency, followed by box plots to identify missing values (outliers), and finally a final selection of a device per group.

Freezer Category

As can be seen, there are 5 devices in the freezer category, all of which detect a small number of anomalies and appear to be consistent in their patterns; however, with so many devices, it will be difficult to approach an analysis for all of them; therefore, to limit our research, we will cut out E and Y devices due to their unprogized prattens that could complicate the modeling, and we will leave them out.

Refrigerator Category

In comparison to the freezer, the fridge approached more devices on its scale and thus more possible dor analysis, but it also stayed above due to a complication of the amount provided, which could slow down the research and modeling. For future steps, we would like to stick with device B, S, and G because of their lack of anomalies and pattern consistency.

Mapping Category

When it came to the pantry, it can be seen that the majority of the devices ended up in this group, which could be due to the scale range that we provided, but within a follow-up for the freezer and fridge, this part allowed us to see the pattern and provide additional limitations so that we could continue with our analysis phase.

Rest of the categories: Incubator, Production Area and Storage

Elimination of the Second Category (Box Plots)

After manually selecting three devices from each category, we're ready to dig further into them, moving on to a second categorical elimination in the form of box plots, where we can see their outliers and eliminate the devices with the most of them.

Freezers

Based on the results, we will choose device K since it has less outliers than devices AA and H, which can be easily spotted within a large number of them, slowing down our research and harming further modeling. As a consequence, we will continue to use device K for future study.

Refrigerators

In the case of the refrigerator, it is clear that device B will be eliminated due to its outliers, and that device G will be eliminated due to the outliers accumulating in the scale from 5 to 4 degrees. As a result, we will stick with device S, even though it contributed to a few of the outliers, because it is more useful for further research and modeling than the other two.

Mappings

Following the same pattern as before, we will stick to device I as our main research recognition for the pantry category, based on the outliers in device D and AB. This is done with the idea that trained models with outliers will increase error variance and reduce the power of statistical tests for our modeling, so we approached to be sticking with device I.

Rest of the categories: Incubator, Production Area and Storage

Following the same pattern as before, even if device AC does not appear to have any outliers, it is still the device that requires additional supporting data to proceed with further research, thus the second apparent choice is device D, which has the second fewest outliers.

Final Choice

Based on the analysis that we've done, we decide to use Device K for the freezer, Device S for the Refrigerator, Device I for the mapping, and lastly Device D as a storage. All of these devices will be accompanied for the following step, which will include a deeper dig into them in terms of missing data, a general session overview, restrictions, and so on, as well as preparing them for modeling.

Analysis for Device K (Freezer)

It can be seen that device K has a total of 59 missing values, which can be used to expand on and allocate with visuals. The following step will be implemented, which will aid in the detection of outliers..

Seasonality Overview

From the plot below, we can see that on the first plot is showing the outliers from device K in each month, the second plot is showing the overview for the entire week, and lastly the third plot is showing us the overview for the whole day (24 Hours).

The seasonal overview shows the spikes in temperature over time in a week's time frame, followed by hourly fluctuations. It can be seen that device K has some ups and downs in temperature, which is not a good indicator, but it is necessary to study more.

Monthly Examination

This will be a representation of the temperature view across the months in order to track the device's progress.

On the basis of the preceding representative, the day-by-day representation of temperature over time may be noticed. The device is mostly stable, with a few black patches that demonstrate some spiking and light grey representing the start and end of the dataset provided. This shows promising results, but it's also a good idea to take a closer look because that's where the next visualisation will appear.

Upper and Lower Limits

As we can see from the plot below that the upper limit from Device K is -20 degrees and the lower limit is below -20 degrees.

As can be observed, there are a few spikes that exceed the device's upper limit, which is definitely an issue that has to be resolved. This is where the additional steps come in useful to allocate and resolve such outliers, as well as prepare the device K for further interaction.

Missing Data

Here we can check upon missing values of device K.

As it appears that device K has 59 missing values, we may use outliers analysis to determine where they are.

Outliers Analysis

With that in mind, let's look at the outliers using a density plot and a box plot. Outliers are data points that are markedly different from the rest. On the boxplot, this will be clearly visible.

As can be seen for device K, there is a strong bias toward negative temperatures, which is understandable given that it is a freezer; nevertheless, it is also clear that it has a significant number of outliers, which will be resolved in a subsequent stage using IQR.

The Inter-Quartile Range proximity rule is abbreviated as IQR. Where Q1 and Q3 are the dataset's 25th and 75th percentiles, respectively, and IQR is the inter-quartile range, calculated as Q3 – Q1.

As can be seen, the range is somewhat narrow, with the majority of the values being negative, which is an obvious observation. Let's look at the same graphs but with trimmed data to see the difference.

While the density plot has more consistency, the boxplot still has some outliers, and to eliminate them, it is useful to utilize the capping approach, which will help to eliminate more values.

The destribution of temperature values is obvious now that the density and box plot have evolved, but in order to retain such limits in the future interaction, they must be analyzed by clustering and then removed from the dataset.

Clustering The Outliers

Removing the Outliers

Analysis for Device S (Refrigerators)

Missing Data

Before going deeper into the device's initial analysis, it's a good idea to check for any missing data first, it comes down to similar steps as the previous device.

As can be seen on this analysis, device S contains up to 7 missng values, which is much lower compare to the previous device and a total of 315.648.

Seasonality Overview

From the plot below, we can see that on the first plot is showing the outliers from device S in each month, the second plot is showing the overview for the entire week, and lastly the third plot is showing us the overview for the whole day (24 Hours).

Upper and Lower Limits

Upper and lower limits are taken from another dataset; for example, for device S, the upper limit is 25 and the bottom limit is 8; to show this, as well as the device's consistency through time, see the visual below.

Seasonal Decomposition

From the plot below we can see the actual dataset, we can the trend of the data, we can see if there's seasonality in the data, and lastly we can see if there's any residual in our data.

Outliers Analyis

With that in mind, let's look at the outliers using a density plot and a box plot. Outliers are data points that are markedly different from the rest. On the boxplot, this will be clearly visible.

Based on the samples, the approximate range is in the 3 to 6 range. It is also useful to visualize further to compare the findings of the broad visualisation above.

The trimmed data displays a consistent density plot with some suspicious spices in the middle, however the box plot is clear and inside the outliers' elimitations. Let's look at the data to see if anything is still missing.

Clustering the Outliers

Removing the Outliers

Analysis for Device I (Mapping)

Missing Data

Seasonality Overview

From the plot below, we can see that on the first plot is showing the outliers from device I in each month, the second plot is showing the overview for the entire week, and lastly the third plot is showing us the overview for the whole day (24 Hours).

Upper and Lower Limits

Upper and lower limits are taken from another dataset; for example, for device I, the upper limit is 25 and the bottom limit is 8; to show this, as well as the device's consistency through time, see the visual below.

Seasonal Decomposition

From the plot below we can see the actual dataset, we can the trend of the data, we can see if there's seasonality in the data, and lastly we can see if there's any residual in our data.

Outliers Analyis

According to the density plot, the variable is responsible for two large spikes in the distribution, and the box plot has outliers on both sides; to clarify, the IQR method will be used.

It's useful to compare the same images with a trimmed dataset after elimination.

It is visible as a clear focus on the upper limit of the data after trimming, but to eliminate it and also intarigate the lower limit, it is useful to apply capping and compare the visuals for the subsequent phases.

Before continuing on, let's check on the missing values. After capping, it's evident on the boundaries where the outliers are eliminated and the density of the variable is distributed in a range of 15 to 20, let's check on the missing values.

Clustering the Outliers

Removing the Outliers

Analysis for Device D (Storage)

Missing Data

Seasonality Overview

Seasonal Decomposition

From the plot below we can see the actual dataset, we can the trend of the data, we can see if there's seasonality in the data, and lastly we can see if there's any residual in our data.

Outliers Analyis

It's a densely distributed density plot, but it's also clear in a boxplot that this device has a lot of them, and the next step will be to evaluate the borders to help limit them in the future.

After trimming, it's crucial to visualize the top limit to see the difference between the previous phase and then proceed.

Despite the fact that there are some outliers to account for, the density is clearly approaching the left side and subsequently decreasing on the right, capping is useful to include the lower limits.

The desnity is evident and strong, with about equal skewing, and the box plot also illustrates the elimination of outliers, with both graphs in the same temperature range for storage. It is also crucial to check for missing values before going towards seasonality.

Clustering the Outliers

Removing the Outliers

Modeling

Device I NeuralProphet prediction

Device K Predictions

Seasonal Arima model. Daily value prediction

Seasonal Auto-Arima model. Daily value prediction

Seasonal Auto-Arima model. Hourly prediction

NeuralProphet model. Daily prediction

NeuralProphet mode. All data prediction

Device S predictions

Seasonal Arima model. Daily value prediction

Seasonal Auto-Arima model. Daily value prediction

NeuralProphet model. Daily Prediction

Device D predictions

Seasonal Auto-Arima model. Daily value prediction

NeuralProphet model. Daily prediction

NeuralProphet mode. All data prediction